perm filename PROP1[7,ALS] blob
sn#032373 filedate 1973-04-04 generic text, type T, neo UTF8
April 3 1973
␈α?␈α?␈α?␈α?␈α?␈α∃A Proposal for Speech Understanding Research
It is proposed that the work on speech recognition that is now under way in the A.I.
project at Stanford University be continued and extended as a separate project with
broadened aims in the field of speech understanding. This work gives considerable promise
both of solving some of the immediate problems that beset speech understanding research
and of providing a basis for future advances.
It ␈α is ␈α further ␈α proposed ␈αλthat ␈αλthis ␈αλwork ␈αλbe ␈αλmore ␈αλclosely ␈αλtied ␈αλto ␈αλthe ␈αλARPA ␈αλSpeech
Understanding ␈ααResearch ␈ααgroups ␈ααthan ␈ααit ␈ααhas ␈ααbeen ␈ααin ␈ααthe ␈ααpast ␈ααand ␈ααthat ␈ααit ␈ααhave ␈ααas ␈α↓its ␈α↓express
aim ␈αβthe ␈αβstudy ␈αβand ␈αβapplication ␈αβto ␈αβspeech ␈αβrecognition ␈αβof ␈αβa ␈ααmachine ␈ααlearning ␈ααprocess, ␈ααthat ␈ααhas
proved ␈α¬highly ␈α¬successful ␈α¬in ␈α¬another ␈α∧application ␈α∧and ␈α∧that ␈α∧has ␈α∧already ␈α∧been ␈α∧tested ␈α∧out ␈α∧to ␈α∧a
limited ␈ααextent ␈ααin ␈ααspeech ␈α↓recognition. ␈α↓ ␈α↓The ␈α↓machine ␈α↓learning ␈α↓process ␈α↓offers ␈α↓both ␈α↓an ␈α↓automatic
training ␈αεscheme ␈αεand ␈αεthe ␈αεinherent ␈αεability ␈αεof ␈αεthe ␈αεsystem ␈αεto ␈αεadapt ␈α¬to ␈α¬various ␈α¬speakers ␈α¬and
dialects. ␈α ␈α Speech ␈α recognition ␈α via ␈α machine ␈α learning ␈α represents ␈α a ␈α global ␈α approach ␈α to ␈αλthe
speech ␈αβrecognition ␈αβproblem ␈ααand ␈ααcan ␈ααbe ␈ααincorporated ␈ααinto ␈ααa ␈ααwide ␈ααclass ␈ααof ␈ααlimited ␈ααvocabulary
systems.
Finally ␈α¬we ␈α¬would ␈α¬propose ␈α¬accepting ␈α¬responsibility ␈α¬for ␈α¬keeping ␈α¬other ␈α¬ARPA ␈α∧projects
supplied with operating versions of the best current programs that we have developed. The
availability ␈α¬of ␈α∧the ␈α∧high ␈α∧quality ␈α∧front ␈α∧end ␈α∧that ␈α∧the ␈α∧signature ␈α∧table ␈α∧approach ␈α∧provides ␈α∧would
enable ␈α¬designers ␈α¬of ␈α¬the ␈α¬various ␈α¬over-all ␈α¬systems ␈α¬to ␈α¬test ␈α¬the ␈α¬relative ␈α∧performance ␈α∧of ␈α∧the
top-down ␈αβportions ␈αβof ␈αβtheir ␈αβsystems ␈αβwithout ␈αβhaving ␈αβto ␈αβmake ␈αβallowances ␈αβfor ␈αβthe ␈ααdeficiencies
of ␈α∧their ␈α∧currently ␈α∧available ␈αβfront ␈αβends. ␈αβ ␈αβIndeed, ␈αβif ␈αβthe ␈αβsignature ␈αβtable ␈αβscheme ␈αβcan ␈αβbe ␈αβmade
simple ␈αβenough ␈αβto ␈αβcompete ␈ααon ␈ααa ␈ααtime ␈ααbasis ␈αα(and ␈ααwe ␈ααbelieve ␈ααthat ␈ααit ␈ααcan) ␈ααthen ␈ααit ␈ααmay ␈ααreplace
the other front end schemes that are currently in favor.
Stanford ␈ααUniversity ␈ααis ␈ααwell ␈ααsuited ␈ααas ␈ααthe ␈ααsite ␈ααfor ␈α↓such ␈α↓work, ␈α↓having ␈α↓both ␈α↓the ␈α↓facilities
for ␈α
this ␈α
work ␈α and ␈α a ␈α staff ␈α of ␈α people ␈α with ␈α experience ␈α and ␈α interest ␈α in ␈α machine ␈α learning,
phonetic analysis, and digital signal processing.
Ultimately we would like to have a system capable of understanding speech from an
unlimited domain of discourse and with unknown speakers. It seems not unreasonable to
expect the system to deal with this situation very much as people do when they adapt their
understanding processes to the speakers idiosyncrasies during the conversation. The
signature table method gives promise of contributing toward the solution of this problem as
well as being a possible answer to some of the more immediate problems.
The initial thrust of the proposed work would be toward the development of adaptive
learning techniques, using the signature table method and some more recent varients and
extentions of this basic procedure. We have already demonstrated the usefulness of this
method for the initial assignment of significant features to the acoustic signals. One of the
next steps will be to extend the method to include acoustic-phonetic probabilities in the
decision process. Ultimately we would hope to take account of syntactic and semantic
constraints in a somewhat analogous fashion.
Still another aspect to be studied would be the amount of preprocessing that should
be done and the desired balance between bottom-up and top-down approaches. It is fairly
obvious that decisions of this sort should ideally be made dynamicallly depending upon the
familiarity of the system with the current domain of discourse and with the characteristics of
the current speaker. Compromises will undoubtedly have to be made in any immediately
realizable system but we should understand better than we now do the limitations on the
system that such compromises impose.
It may be well at this point to discribe the general philosophy that has been followed
in the work that is currently under way and the results that have been achieved to date.
We have been studying elements of a speech recognition system that is not dependent upon
the use of a limited vocabulary and that can recognize continuous speech by a number of
different speakers.
Such a system should be able to function successfully either without any previous
training for the specific speaker in question or after a short training session in which the
speaker would be asked to repeat certain phrases designed to train the system on those
phonetic utterances that seemed to depart from the previously learned norm. In either case
it is believed that some automatic or semi-automatic training system should be employed to
acquire the data that is used for the identification of the phonetic information in the speech.
We believe that this can best be done by employing a modification of the signature table
scheme previously discribed. A brief review of this earlier form of signature table is given
in Appendix 1.
The ␈απover-all ␈απsystem ␈απis ␈απenvisioned ␈απas ␈απone ␈απin ␈απwhich ␈απthe ␈απmore ␈απor ␈απless ␈απconventional
method ␈ααis ␈ααused ␈ααof ␈ααseparating ␈ααthe ␈ααinput ␈ααspeech ␈ααinto ␈ααshort ␈ααtime ␈ααslices ␈α↓for ␈α↓which ␈α↓some ␈α↓sort ␈α↓of
frequency ␈α
analysis, ␈α
homomorphic, ␈α
LPC, ␈α
or ␈αthe ␈αlike, ␈αis ␈αdone. ␈α ␈αWe ␈αthen ␈αinterpret ␈αthis
information ␈α∧in ␈α∧terms ␈α∧of ␈α∧significant ␈α∧features ␈α∧by ␈α∧means ␈α∧of ␈α∧a ␈α∧set ␈αβof ␈αβsignature ␈αβtables. ␈αβ ␈αβAt ␈αβthis
point ␈α↓we ␈α↓define longer sections of the speech called EVENTS which are obtained by grouping
togather ␈α¬varying ␈α¬numbers ␈α¬of ␈α¬the ␈α¬original ␈α¬slices ␈α¬on ␈α¬the ␈α¬basis ␈α¬of ␈α∧their ␈α∧similarity. ␈α∧ ␈α∧This ␈α∧then
takes ␈ααthe ␈ααplace ␈ααof ␈ααother ␈ααforms ␈ααof ␈α↓initial ␈α↓segmentation. ␈α↓ ␈α↓Having ␈α↓identified ␈α↓a ␈α↓series ␈α↓of ␈α↓EVENTS
in ␈αεthis ␈αεway ␈αεwe ␈αεnext ␈αεuse ␈αεanother ␈α¬set ␈α¬of ␈α¬signature ␈α¬tables ␈α¬to ␈α¬extract ␈α¬information ␈α¬from ␈α¬the
sequence ␈αof ␈αevents ␈αand ␈αcombine ␈αit ␈αwith ␈αa ␈αlimited ␈αamount ␈αof ␈αsyntactic ␈αand ␈αsemantic
information to define a sequence of phonemes.
While it would be possible to extend this bottom up approach still further, it seems
reasonable to break off at this point and revert to a top down approach from here on. The
real difference in the overall system would then be that the top down analysis would deal
with the outputs from the signature table section as its primatives rather than with the
outputs from the initial measurements either in the time domain or in the frequency domain.
In the case of inconsistancies the system could either refer to the second choices retained
within the signature tables or if need be could always go clear back to the input parameters.
The decision as to how far to carry the initial bottom up analysis must depend upon the
relative cost of this analysis both in complexity and processing time and the certainty with
which it can be performed as compaired with the costs associated with the rest of the
analysis and the certainty with which it can be performad, taking due notice of the costs in
time of recovering from false starts.
Signature tables can be used to perform four essential functions that are required in
the automatic recognition of speech. These functions are: (1) the elimination of superfluous
and redundant information from the acoustic input stream, (2) the transformation of the
remaining information from one coordinate system to a more phonetically meaningful
coordinate system, (3) the mixing of acoustically derived data with syntactic, semantic and
linguistic information to obtain the desired recognition, and (4) the introduction of a learning
mechanism.
The following three advantages emerge from this method of training and evaluation.
1) Essentially arbitrary inter-relationships between the input terms are taken in
account by any one table. The only loss of accuracy is in the quantization.
2) The training is a very simple process of accumulating counts. The training samples
are introduced sequentially, and hence simultaneous storage of all the samples is not
required.
3) The process linearizes the storage requirements in the parameter space.
The signature tables, as used in speech recognition, must be particularized to allow
for the multi-catagory nature of the output. Several forms of tables have been investigated.
Details of the current system are given in Appendix 2. Some results are summarized in an
attached report.
Work is currently under way on a major refinement of the signature table approach
which adopts a somewhat more rigorous procedure. Preliminary results with this scheme
indicate that a substantial improvement has been achieved.